Conceptual


Q1

(a)

A flexible statistical learning method will perform better.

Reason: Sample size of traning data is large, meaning that flexible models have potentioal to fit a wider range of possible shapes of f. However, inflexible with limited parameters cannot do this.

(b)

A flexible statistical learning method will perform worse.

Reason: Using this data set, Curse of Dimensionality will occure, flexible model will tend to be overfitting.

(c)

A flexible statistical learning method will perform better.

Reason: For a highly non-linear relationship, flexible model without limitations of parameters (more degrees of freedom) can find more accurate shape for f.

(d)

A flexible statistical learning method will perform worse.

Reason: In general, more flexible methodsd have higher variance.



Q2

(a)

Regeression problem. Inference.

n=500, p=profit, number of employees, industry

(b)

Classification problem. Prediction.

n=20, p=price charged, marketing budget, comp. price, ten other variables

(c)

Regression problem. Prediction.

n=48, p= change in US market, % change in British market, % change in German market



Q3

(a)

Plot

Plot

(b)



Q5

Advantages: When we use flexible methods, it can better fit for non-linear model and decrease bais.

Disadvantages: It requires the estimation of many parameters, tends to increase variance and overfit.

When we are interested in prediction, we prefer use flexible methods.

When we are interested in inference, we prefer less flexible methods.



Q6

For parametric approachs, we make explicit assuptuions about the functional form of f and we transform the problem of estimating f to estimate a set of parameters. For non-parametric function, there is no assuption about f and it requires a large number of observations to estimate f.

Advantages: It is easier to estimate parameters rather than fit an function.

Disadvantages: We don’t know the exact form of real f, so sometimes the parametric method we chose may be far from f (underfit or overfit).



Q7

(a)

The Euclidean distance between each observation and the test point:

Obs Distance
1 3
2 2
3 \(\sqrt{10}\)
4 \(\sqrt5\)
5 \(\sqrt{2}\)
6 \(\sqrt3\)

(b)

When K=1, we chose objection 5 as the nearest neighbor, so the prediction is Green.

(c)

When k=3, we chose objection 2,5,6 as the nearest neighbor, so the prediction is Red.

(d)

When the Bayes boundary is highly non-linear, small value of k would be better.

The reason is that when k is small,the boundary would be very flexible. When K is large, it tries to fit a linear boundary



Applied

Q8

(a)
college=read.csv("D:/ISLR_hw/College.csv")
head(college)
##                              X Private Apps Accept Enroll Top10perc
## 1 Abilene Christian University     Yes 1660   1232    721        23
## 2           Adelphi University     Yes 2186   1924    512        16
## 3               Adrian College     Yes 1428   1097    336        22
## 4          Agnes Scott College     Yes  417    349    137        60
## 5    Alaska Pacific University     Yes  193    146     55        16
## 6            Albertson College     Yes  587    479    158        38
##   Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 1        52        2885         537     7440       3300   450     2200  70
## 2        29        2683        1227    12280       6450   750     1500  29
## 3        50        1036          99    11250       3750   400     1165  53
## 4        89         510          63    12960       5450   450      875  92
## 5        44         249         869     7560       4120   800     1500  76
## 6        62         678          41    13500       3335   500      675  67
##   Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1       78      18.1          12   7041        60
## 2       30      12.2          16  10527        56
## 3       66      12.9          30   8735        54
## 4       97       7.7          37  19016        59
## 5       72      11.9           2  10922        15
## 6       73       9.4          11   9727        55
(b)
rownames(college)=college[,1]
fix(college)

Try

college=college[,-1]
fix(college)
(c)
i
summary(college)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00
ii
pairs(college[,1:10])

iii
boxplot(Outstate~Private, college)

###### iv

Elite=rep('No',nrow(college))
Elite[college$Top10perc>50]='Yes'
Elite=as.factor(Elite)
college=data.frame(college,Elite)

summary(college$Elite)
##  No Yes 
## 699  78
boxplot(Outstate~Elite,college)

v
par(mfrow=c(2,2))
hist(college$Apps)
hist(college$Accept)
hist(college$Enroll)
hist(college$PhD)

vi
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
phd_private=mean(filter(college,Private=='Yes')$PhD)
phd_pubilc=mean(filter(college,Private=='No')$PhD)

In average, public universities has more PhD stduets, meaning that they are more focued on research.

college_1=mutate(college,Accept_rate=Accept/Apps)
Accept_elite=mean(filter(college_1,Elite=='Yes')$Accept_rate)
Accept_nonelite=mean(filter(college_1,Elite=='No')$Accept_rate)

In average, elite universities has lower acceptance rare than non-elite ones.



Q9

remove missing value from the data

auto=read.csv("D:/ISLR_hw/Auto.csv", na.strings="?")
auto=na.omit(auto)
(a)
head(auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

We can see that mpg, cylinders, displacement, horsepower, weight, year and acceleration are quantitative, origin and name are qualitative.

(b)
apply(auto,2,range)
##      mpg    cylinders displacement horsepower weight acceleration year
## [1,] " 9.0" "3"       " 68.0"      " 46"      "1613" " 8.0"       "70"
## [2,] "46.6" "8"       "455.0"      "230"      "5140" "24.8"       "82"
##      origin name                     
## [1,] "1"    "amc ambassador brougham"
## [2,] "3"    "vw rabbit custom"
#sapply(auto[,1:7],range)
(c)
sapply(auto[, 1:7], mean)
##          mpg    cylinders displacement   horsepower       weight 
##    23.445918     5.471939   194.411990   104.469388  2977.584184 
## acceleration         year 
##    15.541327    75.979592
sapply(auto[, 1:7], sd)
##          mpg    cylinders displacement   horsepower       weight 
##     7.805007     1.705783   104.644004    38.491160   849.402560 
## acceleration         year 
##     2.758864     3.683737
(d)
auto_new=auto[-(10:85),]
  #Auto[-(10:85),]

apply(auto_new,2,range)
##      mpg    cylinders displacement horsepower weight acceleration year
## [1,] "11.0" "3"       " 68"        " 46"      "1649" " 8.5"       "70"
## [2,] "46.6" "8"       "455"        "230"      "4997" "24.8"       "82"
##      origin name                     
## [1,] "1"    "amc ambassador brougham"
## [2,] "3"    "vw rabbit custom"
sapply(auto_new[, 1:7], mean)
##          mpg    cylinders displacement   horsepower       weight 
##    24.404430     5.373418   187.240506   100.721519  2935.971519 
## acceleration         year 
##    15.726899    77.145570
sapply(auto_new[, 1:7], sd)
##          mpg    cylinders displacement   horsepower       weight 
##     7.867283     1.654179    99.678367    35.708853   811.300208 
## acceleration         year 
##     2.693721     3.106217
(e)
library(ggplot2)
#library(GGally)
pairs(auto)

ggplot(auto,aes(weight,displacement))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Here we can see that the displacemnet and weights are postively correlated.

(f)

From the plots in (e), we can see almost all predictors are correlated with mpg except name.



Q10

(a)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
?Boston
## starting httpd help server ...
##  done

There are 506 rows and 14 columns.

Each row reprensents an housing with their attributes. Each column represents a set of one attributes of a housing.

(b)
pairs(Boston)

ggplot(Boston,aes(rad,crim))+geom_point()

It can be seen that Higher index of accessibility to radial highways, more crime.

(c)

Use conclusion in (b).

(d)
hist(Boston[Boston$crim>1,]$crim, breaks=25)

hist(Boston$tax, breaks=25)

hist(Boston$ptratio, breaks=25)

For crime rate, most towns have low crime rates, but there are tails: some suburbs have crime rates over 20, reaching above 80.

For tax, there is a big gap between low taxes towns and hign taxes towns, and the peak is around.

For pupil-teacher ratio, there is a skew.

(e)
nrow(Boston[Boston$chas==1,])
## [1] 35

There are 35 suburbs bound the Charles river.

(f)
median(Boston['ptratio'][,1])
## [1] 19.05

median is 19.05

(g)
subset(Boston,medv==min(Boston$medv))
##        crim zn indus chas   nox    rm age    dis rad tax ptratio  black
## 399 38.3518  0  18.1    0 0.693 5.453 100 1.4896  24 666    20.2 396.90
## 406 67.9208  0  18.1    0 0.693 5.683 100 1.4254  24 666    20.2 384.97
##     lstat medv
## 399 30.59    5
## 406 22.98    5

The 399th and 406th suburb has lowest median value of owner-occupies homes. (We can use quantitles of the above values to compare)

(h)
dim(subset(Boston,rm>7))[1]
## [1] 64
dim(subset(Boston,rm>8))[1]
## [1] 13

There are 64 suburbs avarage more than 7 rooms, 13 suburbs more than 8 rooms.

summary(subset(Boston,rm>8))
##       crim               zn            indus             chas       
##  Min.   :0.02009   Min.   : 0.00   Min.   : 2.680   Min.   :0.0000  
##  1st Qu.:0.33147   1st Qu.: 0.00   1st Qu.: 3.970   1st Qu.:0.0000  
##  Median :0.52014   Median : 0.00   Median : 6.200   Median :0.0000  
##  Mean   :0.71879   Mean   :13.62   Mean   : 7.078   Mean   :0.1538  
##  3rd Qu.:0.57834   3rd Qu.:20.00   3rd Qu.: 6.200   3rd Qu.:0.0000  
##  Max.   :3.47428   Max.   :95.00   Max.   :19.580   Max.   :1.0000  
##       nox               rm             age             dis       
##  Min.   :0.4161   Min.   :8.034   Min.   : 8.40   Min.   :1.801  
##  1st Qu.:0.5040   1st Qu.:8.247   1st Qu.:70.40   1st Qu.:2.288  
##  Median :0.5070   Median :8.297   Median :78.30   Median :2.894  
##  Mean   :0.5392   Mean   :8.349   Mean   :71.54   Mean   :3.430  
##  3rd Qu.:0.6050   3rd Qu.:8.398   3rd Qu.:86.50   3rd Qu.:3.652  
##  Max.   :0.7180   Max.   :8.780   Max.   :93.90   Max.   :8.907  
##       rad              tax           ptratio          black      
##  Min.   : 2.000   Min.   :224.0   Min.   :13.00   Min.   :354.6  
##  1st Qu.: 5.000   1st Qu.:264.0   1st Qu.:14.70   1st Qu.:384.5  
##  Median : 7.000   Median :307.0   Median :17.40   Median :386.9  
##  Mean   : 7.462   Mean   :325.1   Mean   :16.36   Mean   :385.2  
##  3rd Qu.: 8.000   3rd Qu.:307.0   3rd Qu.:17.40   3rd Qu.:389.7  
##  Max.   :24.000   Max.   :666.0   Max.   :20.20   Max.   :396.9  
##      lstat           medv     
##  Min.   :2.47   Min.   :21.9  
##  1st Qu.:3.32   1st Qu.:41.7  
##  Median :4.14   Median :48.3  
##  Mean   :4.31   Mean   :44.2  
##  3rd Qu.:5.12   3rd Qu.:50.0  
##  Max.   :7.44   Max.   :50.0
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
#Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.

More rm, lower crime